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Abstract 

We study learning problems in which the conditional distribution of the output given the input 
varies as a function of additional task variables. In varying-coefficient models with Gaussian pro¬ 
cess priors, a Gaussian process generates the functional relationship between the task variables and 
the parameters of this conditional. Varying-coefficient models subsume hierarchical Bayesian mul¬ 
titask models, but also generalizations in which the conditional varies continuously, for instance, in 
time or space. However, Bayesian inference in varying-coefficient models is generally intractable. 
We show that inference for varying-coefficient models with isotropic Gaussian process priors re¬ 
solves to standard inference for a Gaussian process that can be solved efficiently. MAP inference 
in this model resolves to multitask learning using task and instance kernels, and inference for hi¬ 
erarchical Bayesian multitask models can be carried out efficiently using graph-Laplacian kernels. 
We report on experiments for geospatial prediction. 


1. Introduction 

In standard settings of learning from independent and identically distributed (iid) data, labels y 
of training and test instances x are drawn independently and are governed by a fixed conditional 
distribution p(y |x). A great variety of problem settings relax this assumption; they are widely 
referred to as transfer learning. We study a general transfer learning setting in which the conditional 
p(y |x) is assumed to vary as a function of additional observable variables t. The variables t can 
identify a specific domain that an observation was drawn from (as in multitask learning), or can be 
continuous attributes that describe, for instance, the time or location at which an observation was 
made (sometimes called concept drift). 
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A natural model for this setting is to assume a conditional p(y|x; w) with parameters w that 
vary with t. Such models are known as varying-coefficient models (e.g., Hastie and Tibshirani, 
1993; Gelfand et al., 2003). In iid learning, it is common to assume an isotropic Gaussian prior 
p( w) over model parameters. When the parameters vary as a function of a task variable t, it is 
natural to instead assume a Gaussian process (GP) prior over functions that map values of t to 
values of w. A Gaussian process implements a prior p(u>) over functions u> \'T —? M m that couple 
parameters w e M m for different values of t € T and make it possible to generalize over different 
domains, time, or space. While this model allows to extend Bayesian inference naturally to a variety 
of transfer learning problems, inference in these varying-coefficient models for large problems is 
often impractical: It involves Kronecker products that result in matrices of size nm x nm, with n 
the number of instances and m the number of attributes (Gelfand et ah, 2003; Wheeler and Calder, 
2006). 

Alternatively, varying-coefficient models can be derived in a regularized risk minimization 
framework. Such models infer point estimates of parameters w for different observed values of 
t under some model that expresses how w changes smoothly with t (Fan and Zhang, 2008). At test 
time, point estimates of w arc required for all t observed at the test data points. This is again com¬ 
putationally challenging because typically a separate optimization problem needs to be solved for 
each test instance. Most prominent arc estimation techniques based on kernel-local smoothing (Fan 
and Zhang, 2008; Wu and Chiang, 2000; Fan and Huang, 2005). 

In this paper, we explore Bayesian varying-coefficient models in conjunction with isotropic 
Gaussian process priors. An isotropic prior encodes the assumption that elements of the vector 
of model parameters arc generated independently of one another; isotropic GP priors arc in direct 
analogy to isotropic Gaussian priors that are widely used in iid learning. Our main theoretical result 
is that Bayesian inference in varying-coefficient models with isotropic Gaussian process priors is 
equal to Bayesian inference in a standard Gaussian process with a specific product kernel. The 
main practical implication of this result is that inference for varying-coefficient models becomes 
practical by using standard GP tools. Our theoretical result also leads to insights regarding existing 
transfer learning methods: First, we identify the exact modeling assumptions under which Bayesian 
inference amounts to multitask learning using a Gaussian process with task kernels and instance 
kernels (Bonilla et al., 2007). Secondly, we show that hierarchical Bayesian multitask models {e.g., 
Gelman et al., 1995; Finkel and Manning, 2009) can be represented as Gaussian process priors; 
inference then resolves to inference in standard Gaussian processes with multitask kernels based on 
graph Laplacians (Evgeniou et al., 2005; Alvarez et al., 2011). 

Our main empirical result is that varying-coefficient models with GP priors arc an effective and 
efficient model for prediction problems in which the conditional distribution of the output given 
the input varies in time and geographical location. In our experiments, varying coefficient models 
outperform reference models for the problems of predicting rents and real-estate prices. 

The paper is structured as follows. Section 2 describes the problem setting and the varying- 
coefficient model. Section 3 studies Bayesian inference and presents our main results. Section 4 
presents experiments on prediction of real estate sales prices and monthly rents; Section 5 discusses 
related work and concludes. 
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Figure 1: Generative process of the varying-coefficient model described in Section 2. Variables 
w* denote the feature vector, label, task variable, and parameterization for a 
novel test instance. 


2. Problem Setting and Model 

This section defines a generative process which models a wide class of applications that arc char¬ 
acterized by a conditional distribution p(y |x, w) whose parameterization w varies as a function of 
additional variables t. Figure 1 shows a plate representation of the model. 

A fixed set of instances xi,... ,x„ with x, G T C M m is observable, along with values 
ti,... ,t n € T of a task variable. The process starts by drawing a function u : T M™ ac¬ 
cording to a prior p(uj). The function u> associates any task variable t G T with a corresponding 
parameter vector u>(t) G M m that defines the conditional distribution p(y |x, w(t)) for task t £ T. 
The domain T of the task variable depends on the application at hand. In the simplest case of multi¬ 
task learning, T = {1,..., k} is a set of task identifiers. In hierarchical Bayesian multitask models, 
a tree Q = (T, A) over the tasks T = {1...., k } reflects how tasks arc related; we represent this 
tree by its adjacency matrix A G Vf xk . We also study the setting of concept drift or non-stationary 
learning in which the conditional distribution of y given x varies smoothly in the task variables t 
that can, for instance, comprise time or space. In this case, T C M (/ is a continuous-valued space. 

We model p(u>) using a zero-mean Gaussian process 

u>~£P(0,k) (1) 

that generates vector-valued functions u : T —> M m . The process is specified by a matrix-valued 
kernel function n : T x T -A M mxm that reflects closeness in T. Here, K(t,t/) G M mxm is the 
matrix of covariances between components of the vectors w(t) and uj(t') for t,t' G T. We assume 
that the kernel function n is isotropic; that is, n(t, t'j = t')I mXm for a positive semidefinite 
kernel function kj- : T x T —> M. This corresponds to the assumption that each dimension of 
the vector-valued function co is generated by an independent Gaussian process, and these Gaussian 
processes share a common kernel function k-p. Note that this decoupling is not an independence 
assumption on attributes; it is instead analogous to the assumption of an isotropic normal prior for 
model parameters that justifies the standard /^-regularization. We use Kt G R nx " to denote the 
matrix given by evaluations kj-(ti, tj) of the kernel function k-p. The process evaluates function ui 
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for all tj to create parameter vectors wj = , w„ = u>(t n ). The process then concludes 

by generating labels y, from an appropriate observation model, 

Vi~p{y |xj,Wj) t (2) 

for instance, a standard linear model with Gaussian noise for regression or a logistic function of the 
inner product of w$ and x, for classification. 

The prediction problem is to infer the distribution of the label y t for a new observation x* 
with task variable t*. For notational convenience, we aggregate the training instances into ma¬ 
trix X E M nxm with row vectors xj,.... xj, the task variables into matrix T <E R nX(l with 
row vectors t|,... ,tj, the parameter vectors associated with training observations into a matrix 
W € M nxm with row vectors wj, ..., wj, and the labels y\,..., y n into vector y G y n . 

In this model, the Gaussian process prior p(co) over functions u : T —> M m couples parameter 
vectors u>(t) for different values t of the task variable. The hierarchical Bayesian model of multitask 
learning assumes a coupling of parameters based on a hierarchical Bayesian prior {e.g., Gelman 
et ah, 1995; Finkel and Manning, 2009). We will now show that the varying-cocflicicnt model with 
isotropic GP prior subsumes hierarchical Bayesian multitask models by choice of an appropriate 
kernel function n of the Gaussian process that defines p(oj). Together with results on inference 
presented in Section 3, this result shows how inference for hierarchical Bayesian multitask models 
can be carried out using a Gaussian process. 

The following definition formalizes the hierarchical Bayesian multitask model. 

Definition 1 (Hierarchical Bayesian Multitask Model) Let Q = (T, A) denote a tree structure 
over a set of tasks T = {1, ..., k} given by an adjacency matrix A, with 1 E T the root node. Let 
a G M /: ' denote a vector with entries o\,, ay.. The following process generates the distribution 
p(y|X, T; Q, a) over labels y E y n given instances X, task variables T, the task hierarchy Q. and 
variances a: The process first samples parameter vectors wi,..., w*. E M m according to 

Wi ~ 2V(w|0, crjlmxm) (3) 

w i ~ AA(w|w pa p), afl mXm ) 2<l <k (4) 

where for l E T, pa(l) E T is the unique node with A pa py = 1; then, the process generates 
labels y l ~ p(y|xj, Wj), where p(y\xi, Wj) is the same conditional distribution over labels given 
an instance and a parameter vector as was chosen for the varying-coefftcient model in Equation 2. 
This process defines the hierarchical Bayesian multitask model. 

The following proposition shows that the varying-coefficient model presented in Section 2 subsumes 
the hierarchical Bayesian multitask model. 

Proposition 2 Let Q = (T, A) denote a tree structure over a set of tasks T = {1,..., k} given by 
an adjacency matrix A. Let a E R k be a vector with entries o\, ..., ak- Let La. a- ■ T x T —> M be 
given by La t') = G t p, where Gij denotes the entry at row i and column j of the matrix 

G = (I fexfc -A)- 1 s(l fcxfc -A T ) _1 , 

and S E R fcxfc denotes the diagonal matrix with entries af,..., a\. Let n : T X T —»• R TOXm 
be given by = k A ,cr{t, t')l mXrn and let p(y|X, T; k) = /p(y|W, X)p(W|T; «)dW be 

the marginal distribution over labels given instances and task variables defined by the varying- 
coefficient model. Then it holds that p(y|X, T; n) = p(y |X, T; Q, a). 
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Proposition 2 implies that performing Bayesian prediction in the varying-cocflicicnt model with the 
specified kernel function is identical to performing Bayesian inference in the hierarchical Bayesian 
multitask model. The proof is included in the appendix. In Proposition 2, entries Gt,t' of G represent 
a task similarity derived from the tree structure Q. Instead of a tree structure over tasks, feature 
vectors describing individual tasks may also be given (Bonilla et ah, 2007; Yan and Zhang, 2009). 
In this case, n(t, t') can be computed from the task features; the varying-cocflicicnt model then 
subsumes existing approaches for multitask learning with task features (see Section 3.3). 

3. Inference 

We now address the problem of inferring predictions y* for instances x*, and task variables t*. 
Section 3.1 presents exact Bayesian solutions for regression; Section 3.2 discusses approximate 
Bayesian inference for classification. Section 3.3 derives existing multitask models as special cases. 

3.1 Regression 

This subsection studies linear regression models of the form p(y|x, w) = A r (y|x T w, t 2 ). Note 
that by substituting for the slightly heavier notation p(y |x, w) = A r (y| < I > (x) T w. r 2 ), this treatment 
also covers finite-dimensional feature maps. The predictive distribution for test instance x+ with 
task variable t* is obtained by integrating over the possible parameter values w* of the conditional 
distribution that has generated value y*: 

p(y*|X,y,T,x*,t*) = J p(y*|x*, w*)p(w*|X,y, T, t*)dw*, (5) 

where the posterior over w* is obtained by integrating over the joint parameter values W that have 
generated the labels y for instances X and task variables T: 

p( w*|X, y, T, t*) = Jp{ w*|W, T, t*)p(W|X, y, T)dW. (6) 

Posterior distribution p(W|X, y, T) in Equation 6 depends on the likelihood function—the linear 
model—and the GP prior p(u>). The extrapolated posterior p(w*|W, T, t*) for test instance x* with 
task variable t* depends on the Gaussian process. The following theorem states how the predictive 
distribution given by Equation 5 can be computed. 

Theorem 3 (Bayesian Predictive Distribution) Let y = M, p(y|x, w) = AA(y|x T w, r 2 ), and 
let the kernel matrix K>r he positive definite. Let K E M nxn be a matrix with components 
kij = ty) and k E M n be a vector with components ki = xj r x*fc 7 '(tj, t*). Then , the 

predictive distribution for the varying-coefficient model defined in Section 2 is given by 

p(y*l x >y> T >x*,t*) = Af{yfip,a 2 + r 2 ) (7) 


with 


T = k T (K + r 2 I nXn ) V, 

O- 2 = xjx*/c r (t*,t*) - k T (K + r 2 I IJXn ) _1 k. 
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Before we prove Theorem 3, we highlight three observations about this result. First, the distribution 
p(y*|X, y, T, x*, t*) has a surprisingly simple form. It is identical to the predictive distribution of a 
standard Gaussian process that uses concatenated vectors (xi, ti),..., (x n , t n ) G X x T as training 
instances, labels y \,..., y n , and the product kernel function fc((xj, tj), (x^-, tj)) = xjx.jkj-(ti, tj). 

Secondly, instances xi,... ,x n ,x* € X only enter Equation 7 in the form of inner prod¬ 
ucts. The model can therefore directly be kernelized by defining the kernel matrix as K 4J = 
tj) with kernel function /c^(x 4 , Xj) = < I ) (x ) ) T< I>(x J ) where $ maps to areproduc- 
ing kernel Hilbert space. When the feature space is finite, then uj maps the tj to a finite-dimensional 
Wj and Theorem 3 implies a Bayesian predictive distribution derived from the generative pro¬ 
cess that Section 2 specifies. When the reproducing kernel Hilbert space does not have a finite 
dimension. Section 2 does no longer specify a corresponding proper generative process because 
p(wi,..., w n |T) would otherwise become infinite-dimensionally normally distributed. However, 
given the finite sample X and T, a Mercer map (see, e.g., Scholkopf and Smola, 2002, Section 
2.2.4) constitutes a finite-dimensional space M n for which Section 2 again characterizes a corre¬ 
sponding generative process. 

Thirdly and finally. Theorem 3 shows how Bayesian inference in varying-coefficient models 
with isotropic priors can be implemented much more efficiently than in general varying-coefficient 
models. Bayesian inference in varying-coefficient models in the parameter space generally involves 
matrices of size nm x nm because it needs to take the overall covariance structure into account; 
the algorithm of Gelfand et al. infers the covariance matrix under an inverse Wishart prior using a 
sliced Gibbs sampler over parameter values Gelfand et al. (2003). This makes inference impractical 
for large-scale problems. Theorem 3 shows that under the isotropy assumption, the latent parameter 
vectors wj,..., w ra can be integrated out, which results in a GP formulation in which the covariance 
structure over parameter vectors resolves to an n x n product-kernel matrix. 

Proof of Theorem 3. Let Wi r and w* r denote the ?’-th elements of vectors w, and w*, and let x ir 
and denote the r-th elements of vectors x, and x*. Let z* = (z \,..., z n . z i ,) T G M n+1 with 
Zi = xjw, and = xjw*. Because wi,..., w n , w* arc evaluations of the function uj drawn from 
a Gaussian process (Equation 1), they arc jointly Gaussian distributed and thus z\., z n , c* arc 
also jointly Gaussian (e.g., Murphy, 2012, Chapter 10.2.5). Because m is drawn from a zero-mean 
process, it holds that E [zj\ = E[^^ =1 Xi r Wi r \ = Xir^[wi r \ = 0 as well as E[z*] = 0 and 

therefore 

p(z*|X, T, x*, t*) = AA(z*|0, C) 

where C G ]R( n + 1 ) x ( n + 1 ) denotes the covariance matrix. Lor the covariances JL[z l Zj it holds that 



m m 

— ''y ^ y \ XigXjrM. [rr'c s 7i,-.y r ] 
s=l r =1 


m 

= ^2 XisXjs'E [w is Wj s ] 
s =1 

= xj X jk-J {i;.tj). 


( 8 ) 

(9) 
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In Equations 8 and 9 we exploit the isotropy of the Gaussian process prior: the covariance K[w ls Wj r ] 
is the element in row s and column r of the matrix n(ti,tj) £ M mxm obtained by evaluating the 
kernel function k:TxT-> K mxm at (t i, t j); the isotropy assumption K(t, t 7 ) = kp(t, t 7 )I mxm 
means that this matrix is diagonal with E [wi S Wj r \ = 0 for s r and Efu^ur/s] = (see 

Section 2). We analogously derive 

E [z i z i '\=x]x*kT(ti,t*), (10) 

E[z*z*] = xjx*fe r (t*,t*). (11) 

Equations 9, 10 and 11 define the covariance matrix C, yielding 

p(z*|X,T,x*,t*) = J\T ^z*|0, ^ ^ ^ )) 

where £;* = xjx*fc r (t*. t*). For y* = (yi,..., y n , y*) it now follows that 

p(y*|X,T,x*,t*)=Af(y*|oY ^ ( 12) 

The claim now follows by applying standard Gaussian identities to compute the conditional distri¬ 
bution p(y*|X, y, T, x*, t*) from Equation 12. ■ 


3.2 Classification 

The result given by Theorem 3 can be extended to classification settings with y = {0,1} by using 
non-Gaussian likelihoods p(y\z) that generate labels y 6 y given outputs z 6 lof the linear model. 


Theorem 4 (Bayesian predictive distribution for non-Gaussian likelihoods) Let y = {0,1}. Let 

p(yi|xj, w,j) be given by a generalized linear model, defined by Zi ~ Af(z\wjxi, r 2 ) and yi ~ 
p(y\zi). Let p(y* |x*,w*) be given by z * ~ A^(z|wJ"x*, r 2 ) and y* ~ p(y\z*). Let furthermore 
z = (zi,...,z n ) T £ M n . 

Let the kernel matrix Kt be positive definite, and let K £ W ixn be a matrix with compo¬ 
nents kij = tj) and k £ M n a vector with components k{ = x^x*^^, t*). Then, the 

predictive distribution for the GP model defined in Section 2 is given by 

p(y*|X,y, T,x*,t*) oc JJ p(yfiz*)J\T(z*\fi z , cr 2 )p(y|z)Af(z|0, K + r 2 I„ xn )dzd^* (13) 

with 


Mz = k T (K + r 2 I nxn ) x z, 

cj 2 = xjx*k r (t*,t*) - k T (K + T 2 I nXn ) _1 k + r 2 . 

A straightforward calculation shows that Equation 13 is identical to the predictive distribution of 
a standard Gaussian process that uses concatenated vectors (xi,ti),..., (x ra ,t n ) £ X x T as 
training instances, labels yi ,..., the product kernel /r((x z . ty), (x^-, tj)) = xjand 
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likelihood function p(y\z). For non-Gaussian likelihoods, exact inference in Gaussian processes 
is generally intractable, but approximate inference methods based on, e.g., Laplace approximation, 
variational inference or expectation propagation are available. 

Proof of Theorem 4. Rewriting p(y*|X, y, T, x*, t*) in terms of a marginalization over the vari¬ 
ables z and z* leads to: 


P{y* |X,y,T,x*,t*) = J p(y*|z*)p(z*|X, y, T, x*, t*)dz* 

= JJ p(y*|z*)p(z*|X,z,T,x*,t*)p(z|X,y,T)dzdz* 
oc JJ p(y*|z*)p(z*|X,z,T,x*,t*)p(y|z)p(z|X, T)dzdz*. 


The proof now quickly follows from Theorem 3 and derivations in the proof of The¬ 
orem 3: Equation 7 implies p(z*|X, z, T, x*, t*) = J\f(z*\fj, x , &%), Equation 12 implies 

p(z|X, T) = 7V(z|0, K + T 2 I nXn ). M 


3.3 Product Kernels in Transfer Learning 

Sections 3.1 and 3.2 have shown that inference in the varying-coefficient model is equivalent to 
inference in standard Gaussian processes with products of task kernels and instance kernels. Sim¬ 
ilar product kernels are used in several existing transfer learning models. Our results identify the 
generative assumptions that underlie these models by showing that the product kernels which they 
employ can be derived from the assumption of a varying-coefficient model with isotropic GP prior 
and an appropriate kernel function. 

Bonilla et al. (2007) study a setting in which there is a discrete set of k tasks, which are described 
by task-specffic attribute vectors 1 1 ...., t/,.. They study a Gaussian process model based on con¬ 
catenated feature vectors (x, t) and a product kernel k((x, t), (x 7 , t')) = A';t(x. x')(t, t'), where 
A^>(x. x') reflects instance similarity and k-p(t, t') reflects task similarity. Theorems 3 and 4 iden¬ 
tify the generative assumptions underlying this model: a varying-coefficient model with isotropic 
Gaussian process prior and kernel kj- generates task-specific parameter vectors in a reproducing 
Hilbert space of the instance kernel kx ; a linear model in that Hilbert space generates the observed 
labels. 

Evgeniou et al. (2005) and Alvarez et al. (2011) study multitask-learning problems in which task 
similarities arc given in terms of a task graph. Their method uses the product of an instance kernel 
and the graph-Laplacian kernel of the task graph. We will now show that, when the task graph is 
a tree, that kernel emerges from Proposition 2. This signifies that, when the task graph is a tree, 
the graph regularization method of Evgeniou et al. (2005) is the dual formulation of hierarchical 
Bayesian multitask learning, and therefore Bayesian inference for hierarchical Bayesian models 
can be carried out efficiently using a standard Gaussian process with a graph-Laplacian kernel. 

Definition 5 (Graph-Laplacian Multitask Kernel) Let Q = (T, M) denote a weighted undirected 
graph structure over a set of tasks T = {1,..., k} given by a symmetric adjacency matrix M E M fcxfc , 
where M,.j defines the positive weight of the edge between tasks i and j or M,;j = 0 if no such 
edge exists. Let D denote the weighted degree matrix of the graph, and L = D + R — M the 



graph Laplacian, where a diagonal matrix R that acts as a regularizer has been added to the de¬ 
gree matrix (Alvarez et al, 2011). The kernel function /cm,r : (X x T) x (X x T) —> M given 
by 

^M,R((x,t), (x',t')) = L^ t ,x T x', 

where is the pseudoinverse of L, will be referred to as the graph-Laplacian multitask kernel. 


The following proposition states that the graph-Laplacian multitask kernel is equal to the kernel 
that emerges in the dual formulation of hierarchical Bayesian multitask learning (Definition 1). 

Proposition 6 Let Q = (T, A) denote a directed tree structure given by an adjacency matrix A. 
Let a £ be a vector with entries o±,..., <jf.. Let B £ M. kxk denote the diagonal matrix with 
entries 0, af 2 ,..., of 2 , let R £ M. kxk denote the diagonal matrix with entries of 2 , 0,..., 0, let 
M = BA + (BA) t , and let t r ) be defined as in Proposition 2. Then 

&M,R((x,t), (x',t')) = feA,o-(t,t , ) xT X / . 


Note that in Proposition 6, BA is an adjacency matrix in which an edge from node i to node j is 
weighted by the respective precision oj 2 of the conditional distribution (Equation 4); adding the 
transpose yields a symmetric matrix M of task relationship weights. The precision of 2 of the root 
node prior is subsumed in the regularizer R. The proof is included in the appendix. 


4. Empirical Study 

In this section, we study the efficiency and accuracy of different varying-coefficient models and 
baselines for geospatial and temporal regression and classification problems. We focus on the prob¬ 
lems of predicting real estate prices and monthly housing rents. 

For real estate price prediction, we acquire records of real-estate sales in New York City for 
sales dating from January 2003 to December 2009 in June 2013 through the NYC Open Data ini¬ 
tiative 1 . Input variables include the floor space, plot area, property class (such as family home, 
residential condominium, office, or store), date of construction of the building, and the number of 
residential and commercial units in the building. After binarization of multi-valued attributes there 
are 94 numeric attributes in the data set. For regression, the sales price serves as target variable y; 
we also study a classification problem in which y is a binary indicator that distinguishes between 
transactions with a price above the median of 450,000 dollars from transactions below it. Date and 
address for every sale arc available; we transform addresses into geographical latitude and longitude 
using an inverse geocoding service based on OpenStreetMap data. We encode the sales date and 
geographical latitude and longitude of the property as task variable t £ M 3 . 

Price and attributes in sales records vary widely; for instance, prices range from one dollar 
to four billion dollars, and the floor space from one square foot to fourteen million square feet. A 
substantial number of records contain either errors or document transactions in which the valuations 
do not reflect the actual market values: for instance, Manhattan condominiums that sold for one 
dollar, and one-square-foot lots that sold for massive prices. In order to filter most off-market 
transactions by means of a simple policy, we only include records of sales within a price range of 
100,000 to 1,000,000 dollars, a property area range of 500 to 5,000 square feet, and a land area 

1.https://nycopendata.socrata.com/. 
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range of 500 to 10,000 square feet. Approximately 80% of all records fall into these brackets. 
Additionally, we remove all records with missing values. After preprocessing, the data set contains 
231,708 sales records. We divide the records, which span dates from January 2003 to December 
2009, into 25 consecutive blocks. Models are trained on a set of n instances sampled randomly 
from a window of five blocks of historical data and evaluated on the subsequent block; results arc 
averaged over all blocks. 

For rent prediction, we acquire records on the monthly rent paid for privately rented apartments 
and houses in the states of California and New York from the 2013 American Community Survey’s 
ASC public use microdata sample files 2 . Input variables include the number of rooms, number 
of bedrooms, the duration for which the contract has been running, the construction year of the 
building, the type of building (mobile home, trailer, or boat; attached or detached family house; 
apartment building), and variables that describe technical facilities ( e.g ., variables related to internet 
access, type of plumbing, and type of heating). After binarization of multi-valued attributes there 
are 24 numerical attributes in the data. We study a regression problem in which the target variable y 
is the monthly rent, and a classification problem in which y is a binary indicator that distinguishes 
contracts with a monthly rent above the median of 1,200 dollar's from those with a rent below the 
median. For each record, the geographical location is available in the form of a public use microdata 
area (PUMA) code 3 . We translate PUMA codes to geographical latitude and longitude by associ¬ 
ating each record with the longitude-latitude-centroid of the corresponding public use microdata 
area; these geographical latitudes and longitudes constitute the task variable tgl 2 . We remove all 
records with missing values. The preprocessed data sets contain 36,785 records (state of California) 
and 17,944 records (state of New York). Models arc evaluated using 20-fold cross validation; in 
each fold, a random subset of n training instances is sampled randomly from the respective training 
fold. 

We study the varying-coefficient model with isotropic GP prior introduced in Section 2 with 
a Matern kernel ky-(t, t 7 ). Predictions arc obtained from Theorem 3, using either a linear or also 
a Matern kernel function Ayfx. x') (denoted by isoVCM hn and isoVCM mat , respectively). We 
compare with the varying-coefficient model with nonisotropic GP prior by Gelfand et al. (2003), 
in which the covariances are inferred from data (denoted by Gelfand). Furthermore, we compare 
with the kernel-local smoothing varying-coefficient model of Fan and Zhang (2008) that infers point 
estimates of model parameters. We study this model using a linear feature map (Fan & Zhang lm ) 
and a nonlinear feature map constructed from a Matern kernel (Fan & Zhang mat ). Fan and Zhang 
(2008) do not regularize parameter estimates in their original model, we added an t^-regularizer as 
this improved predictive performance. 

We finally compare against an iid baseline that assumes that p(y |x) is constant in t, implemented 
by a standard Gaussian process with a linear ( GP 1 ™) or Matern (GP™ at ) kernel, and with a standard 
Gaussian process that simply concatenates instance and task attribute vectors into vectors (x, t) 
(denoted GP la \ and GP™ 3 /). 

For classification, we use logistic likelihood functions in our model (Theorem 4), and also in the 
GP baselines and the kernel-local smoothing varying-coefficient model of Fan and Zhang (2008). 
All kernel parameters, as well as the observation noise parameter r of Theorem 3 and the obser¬ 
vation noise parameters of the standard GP models are tuned according to marginal likelihood on 

2. http://factfinder.census.gov/faces/affhelp/jsf/pages/metadata.xhtml?lang= 

en&type=document&id=document.en.ACS pums csv 2013#main content. 

3. https://www.census.gov/geo/reference/puma.html. 
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Figure 2: Execution time of isoVCM and reference methods over training set size n. 


the training data. The regularization parameter of the kernel-local smoothing varying-coefficient 
model and its kernel parameter h (see Fan and Zhang, 2008) are tuned on the training data by 
cross-validation. The isoVCM model and all GP baselines are implemented based on the GPML 
Gaussian process toolbox (Rasmussen and Nickisch, 2010). Inference is earned out using the FITC 
approximation based on a low-rank approximation to the exact covariance matrix with 1,000 ran¬ 
domly sampled inducing points (Snelson and Ghahramani, 2005), and using Laplace approximation 
for classification. 

First, we compare the execution time of the GP inference that results from Theorem 3 with 
the execution time of the primal inference procedure of Gelfand et al. (2003) and the execution 
time of the kernel-local smoothing varying-coefficient model of Fan and Zhang (2008). Figure 2 
shows the execution time for model training and prediction on one block of test instances in the 
real estate price prediction task as a function of the training set size n (CPU core seconds, Intel 
Xeon 5520, 2.26 GHz). For the model of Gelfand et al., the most expensive step during inference 
is computation of the inverse of a Cholesky decomposition of an nm x nm matrix, which needs 
to be performed within each Gibbs sampling iteration. Figure 2 shows the execution time of 5,000 
iterations of this step (3,000 burn-in and 2,000 sampling iterations, according to Gelfand et al., 
2003), yielding a lower bound on the overall execution time. An experimental run with Bayesian 
inference for nonisotropic GP priors requires 230 CPU core days even for 100 training instances; 
as matrix inversion scales nearly cubically in n, it is impractical for this application. We therefore 
exclude this method from the remaining experiments. By contrast, full Bayesian inference in our 
GP model takes less than a second. The execution time of the kernel-local smoothing varying- 
coefficient model by Fan and Zhang (2008) substantially differs for the regression and classification 
task. In this model, separate point estimates of model parameters have to be inferred for each test 
instance, for which a separate optimization problem needs to be solved. For regression, efficient 
closed-form solutions for parameter estimates are available, while for classification more expensive 
numerical optimization is required (Fan and Zhang, 2008). 

In all subsequent experiments, each method is given 30 CPU core days of execution time; ex¬ 
periments are run sequentially for increasing number n of training instances and results are reported 
for values of n for which the cumulative execution time is below this limit. 
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Figure 3: Mean absolute error for predicting real estate prices in New York City (left) and mean 
zero-one loss for classifying real estate transactions (right) over training set size n. Error 
bars indicate the standard error. 


Figure 3 shows the mean absolute error for real estate price predictions (left) and the mean zero- 
one loss for classifying sales transactions (right) as a function of training set size n. For regression, 
Fan & Zhang Im and Fan & Zhang mat partially completed the experiments; for classification, both 
methods did not complete the experiment for the smallest value of n. All other methods completed 
the experiments within the time limit. For regression, we observe that isoVCM 1 " 1 is substantially 
more accurate than GP 1 ™, GP^" t , and Fan & Zhang 1111 ; isoVCM mat is more accurate than GP™ at and 
GP™f with p < 0.01 for all training set sizes according to a paired t- test. Significance values of 
paired t-test comparing isoVCM mat and Fan & Zhang mat fluctuate between p < 0.01 and p < 0.2 
for different n, indicating that isoVCM mat is likely more accurate than Fan & Zhang mat . For clas¬ 
sification, isoVCM hn substantially outperforms GP and GP^\; isoVCM mat outperforms GP™ at 
and GP™!' (p < 0.01 for n > 125). 

Figure 4 shows the mean absolute error for predicting monthly housing rent (left) and the mean 
zero-one loss for classifying rental contracts (right) for rental contracts in the state of California (up¬ 
per row) and the state of New York (lower row) as a function of training set size n. Fan & Zhang 1111 
completed the regression experiments within the time limit and partially completed the classification 
experiment; Fan & Zhang Jnat partially completed the regression experiment but did not complete the 
classification experiment for the smallest value of n. We again observe that isoVCM mat yields 
the most accurate predictions for both classification and regression problems; isoVCM 1111 always 
yields more accurate predictions than Fan & Zhang lm and more accurate predictions than GP^\ for 
training set sizes larger than n = 1000. 

5. Discussion and Related Work 

Varying-coefficient models reflect applications in which a conditional distribution of y given x 
is a function of task variables t. The task variables can, for instance, be continuous, discrete, 
or nodes in a tree—as in hierarchical Bayesian multitask learning. The functional dependency 
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Figure 4: Mean absolute error for predicting monthly housing rents (left) and mean zero-one loss 
for classifying rental contracts (right) in the states of California (upper row) and New 
York (lower row) over training set size n. Error bars indicate the standard error. 


between the conditional distribution of the output given the input and the task variables can be 
modeled with a GP prior. Theorem 3 shows that, for isotropic GP priors, Bayesian inference in 
varying-coefficient models can be carried out efficiently by using a standard Gaussian process with 
a kernel that is defined as the product of a task kernel and an instance kernel. This result clarifies 
the exact modeling assumptions required to derive the multitask kernel of Bonilla et al. (2007). 
This result also highlights that Bayesian inference for hierarchical Bayesian learning can be earned 
out efficiently by using a standard Gaussian process with graph-Laplacian kernel (Evgeniou et al., 
2005). 

Product kernels play a role in other multitask learning models. In the linear coregionalization 
model, several related functions are modeled as linear combinations of Gaussian processes; the co- 
variance function then resolves to a product of a kernel function on instances and a matrix of mixing 
coefficients (Journel and Fluijbregts, 1978; Alvarez et al., 2011). A similar model is studied by Wang 
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et al. (2007) in the context of style-content separation in human locomotion data; here mixing co¬ 
efficients are given by latent variables that represent an individual’s movement style. Zhang and 
Yeung (2010) study a model for learning task relationships, and show that under a matrix-normal 
regularize!' the solution of a multitask-regularized risk minimization problem can be expressed using 
a product kernel. Theorem 3 can be seen as a generalization of their result in which the regularize!' 
is replaced by a prior over functions, and the regularized risk minimization perspective by a fully 
Bayesian analysis. 

Non-stationarity can also be modeled in Gaussian processes by assuming that either the resid¬ 
ual variance (Wang and Neal, 2012), or the length scale of the covariance function (Schmidt and 
O’ Hagan, 2003), or the amplitude of the output (Adams and Stegle, 2008) arc input-dependent. 
The varying-cocllicicnt model differs from these models in that the source of non-stationarity is 
observed in the task variable. 

In the domain of real estate price prediction, the dependency between property attributes and the 
market price changes continuously with geographical coordinates and time. We observe that primal 
Bayesian inference in varying-coclhcicnt models with nonisotropic GP priors is all but impractical 
in this domain, while for isotropic GP priors, inference based on Theorem 3 is more efficient by 
several orders of magnitude. Empirically, we observe that the I inear and kernelized isoVCM models 
predict real estate prices and housing rents more accurately over time and space than kernel-local 
smoothing varying-cocllicicnt models, and arc also more accurate than 1 inear and kernelized models 
that append the task variables to the attribute vector or ignore the task variables. 
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Appendix 

Proof of Proposition 2. 

The marginal p(y|X, T; k) is defined by the generative process of drawing u> ~ QV( 0, k), 
evaluating u for the k different tasks to create parameter vectors ur(l),..., u{k), and then draw¬ 
ing y l ~ p(y|xj, ar(tj)) for i = 1 The marginal p(y|X, T; Q, a) is defined by the 

generative process of generating parameter vectors wi,..., w/,. according to Equations 3 and 4 
in Definition 1, and then drawing yi ~ p(y|xj,w ti ) for i = 1 Here, the observa¬ 

tion models p(y|xj,w ti ) and p(y\jc,. ccft.,.)) are identical. It therefore suffices to show that 

p(u( 1), • • |k) = p( Wi,.. .,w k \g, cr). 

The distribution p( wy,..., w k\G, cr) can be derived from standard results for Gaussian graphi¬ 
cal models. Let W E M fcxm denote the matrix with row vectors wj ,..., wj, and let vec (W T ) E 
R km denote the vector of random variables obtained by stacking the vectors wj,..., w/, on top of 
another. According to Equations 3 and 4, the distribution over the random variables within vec(W T ) 
is given by a Gaussian graphical model ( e.g ., Murphy (2012), Chapter 10.2.5) with weight matrix 
A <g) I mxm G M fcmxfcm and standard deviations cr 0 l m , where l m E M m is the all-one vector. It 
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follows that the distribution over vec(W T ) G R km is given by 


p(vec(W T )|g,er) = AA(vec(W T )|0, S) 


with 


= {Ikmxkm ~ A <S> Imxm) 1 diag(<7 <g> l m ) 2 

(1 _ A T (Q. T \-l 

VX km Jr% - <y - a -mXm/ 

(see Murphy (2012), Chapter 10.2.5), where diag(<x <g) l m ) G jjfanxfcm c [ cnotcs q lc diagonal matrix 
with entries a (g> l m . 

The distribution p(us(l),..., u)(k)\n) is given directly by the Gaussian process defining the 
prior over vector-valued functions u> : T — >• M m (see Equation 1). Let f l G M fcxm denote the 
matrix with row vectors io(l) T , ..., us(k) J , then the Gaussian process prior implies 

p(vec(fl J )\K) = AA(vec(fJ) T |0, G <g> I mxm ) 

(see, e.g., Alvarez et al. (2011), Section 3.3). A straightforward calculation now shows G<g)I, nxm = 
X and thereby proves the claim. ■ 


Proof of Proposition 6. In the following we use the notation that is introduced in Proposition 2 
and Definition 5. We first observe that by the definition of the graph Laplacian multitask kernel it is 
sufficient to show that G = iA Since the matrix G is invertible, this is equivalent to G 1 = L. 

We prove the claim by induction over the number of nodes \T\ in the tree Q. If \T\ = 1, then 
we have A = 0, D = 0, R = cr^ 2 and M = 0. This leads to 

G 1 = (I - A t )(t 7 2 (I - A) = erf 1 = D + R — M = L 


and proves the base case. Let us now assume that we have a tree Gk with \T\ = k > 1 nodes. Let 
t be a leaf of this tree and t' shall be its unique parent. Suppose we have t ' = i and w.l.o.g. we 
assume that t = k. Let furthermore Gk-\ be the tree which we get by removing the node k and 
its adjacent edge from the tree Gk- Let A/,, and A*._i denote the adjacency matrices and D/ and 
D/,._i the degree matrices of Gk and Gk- 1 - Let <t/,. € be the vector with entries a\..... 07 ,., and 
(Tk-i G BA 1 be the vector with entries a \,..., 1 ■ Let R/ G R kxk denote the diagonal matrix 

with entries a^ 2 , 0,..., 0, and R*._i G jjfc-ixfc-i t h e diagonal matrix with entries crj -2 , 0,..., 0. 
Let B /. G R kxk denote the diagonal matrix with entries 0, a^ 2 , ■ ■ ■, <rj7 2 and B/,_ | G R k ~ lxk 1 
the diagonal matrix with entries 0, o\j" 2 ,..., ay. 2 ,. Let = B& A fc + (B A :A fc ) T and M fc _i = 
Bfc_iAfc_i T (B^—iA^— 1 ) . Let — L)^ T R& AI^ and L k—i — 1 T R &—1 IVI^—i. 

In the following, we write diag(v) to denote a diagonal matrix with entries v. We then have 


Afc 



where e = ( 0 ,..., 0 , 1 , 0 ,..., 0) 1 
i—1 
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is the i th (n — l)-dimensional unit vector. Using this notation we can write 


G^Hl-A^diagKr^I-A, 


- 2 / 


l-H 

1 

> 

r 


( diag(cr A ._i) 2 

° ^ 

( I-A*_! 

—e 

-e 1 


V o 

a k J 

V 0 

1 


L fc _i + <T k 2 ee T 

~°k 2e ' 




In the last line we applied the induction hypothesis to the tree Qk-i- Using the definitions of L, D, 
R and M, we can easily finish the proof: 


1-1 

*k 


Dfc_i + Rfc_i — + a k J ee T 


V ~ a k 


( D fc _i + crf 2 ee J 

0 

- V 0 

a k 2 

= Dfc + Kk - M/e 

= L/e. 



~cr, 2 e 




CT,. 


+ 


-R'fc—1 


i-i 

( M fc _! 

a k 2e 

0 



0 


This proves the claim. 
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